Flexible Microprocessor Register File

ABSTRACT

Architectures and methods for viewing data in multiple formats within a register file. Various disclosed embodiments allow a plurality of consecutive registers within one register file to appear to be temporarily transposed by one instruction, such that each transposed register contains one byte or word from multiple consecutive registers. A program can arbitrarily reorganize the bytes within a register by swapping the value stored in any byte within the register with the value stored in any other byte within the same register. Indirect register access is also provided, without additional scoreboarding hardware, as an apparent move from one register to another. The functionality of a hardware data FIFO at the I/O is also provided, without the power consumption of register-to-register transfers. However, the size of the FIFO can be changed under program control.

BACKGROUND AND SUMMARY

The present application relates to programmable circuits, and moreparticularly to I/O circuitry with selectable data reordering forgraphics.

A vector processor or array processor is a CPU design that is able torun mathematical operations on multiple data elements simultaneously. Aserial vector is a sequence of data held in registers that are processedby the same instruction. For example, a single instruction may causefour registers to be added to another four and the result written to afurther four. A parallel vector holds several data items within the sameregister, each of which has the same instruction applied to it. Vectorprocessing improves code density and allows optimizations that improveperformance.

A common problem suffered by vector processors is the need to organizedata within the register file such that the same instruction may beapplied to a series of registers. Register files generally only allowsimultaneous access to a set of values aligned along a particulardirection, i.e., along a row of the vector. Accordingly, a singleinstruction can access multiple values for a. horizontal operation, butvertical operation requires either transposing the array being operatedor performing separate access operations for each value in a differentrow. It is common to spend several instructions re-arranging data tomake it suitable for vector processing and this overhead may obviate thebenefits of using a vector.

In view of these limitations, more efficient architectures and methodsfor performing transpose and other array manipulations are desired.

Yet another problem arises when a program instruction indirectlyaccesses a register. Microprocessors control programs' access toregister files. Because of pipelining, some instructions must be stalleduntil the register from which they will read has been written to byanother instruction. Scoreboarding stalls these instructions, so theprogram need not manage stalling. Stall condition is usually appliedearly in the execution pipeline. However, if a register is to heaccessed indirectly by a program instruction, the register may not beknown until it is too late—until after the stall condition wouldnormally have already been applied. Without knowing the register at thatearlier time, it is difficult to apply stall conditions for instructionsthat use indirect access.

The inventions disclosed in the present application provide mechanismsto handle indirect register access without additional scoreboardinghardware, and can be further used to build a flexible FIFO accessmechanism.

Flexible Register I/O Architecture

The present application discloses a register file input/outputconfiguration in which a variety of data transpositions are available atminimum power. Power is conserved by avoiding register-to-register datatransfers; instead, the sequencer provides executable microinstructionswhich imply a variety of apparent data formats (as seen by the datachannel), without unnecessary physical transfers of data.

Various disclosed embodiments provide new ways for microprocessorregister-files to be accessible, in multiple formats in order to reducethe number of program instructions required during byte, word and longword data reformatting. The disclosed innovations, in variousembodiments, provide one or more of at least the following advantages:

Variety of data rearrangements;

Minimal power consumption;

Easy accommodation to special data reordering for digital signalprocessing operations;

Suitability to customized access to data with two-dimensional structure;

Suitability to customized access to data with multidimensionalstructure.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed inventions will be described with reference to theaccompanying drawings, which show important sample embodiments of theinvention and which are incorporated in the specification, hereof byreference, wherein:

FIG. 1 shows how four consecutive registers are viewed withbyte-transpose enabled. Each row in the diagram represents one registeras viewed by a program. When Byte-Transpose is enabled, the registerfile is effectively rotated by 90°, so that Register 0 contains all thelow-bytes of the four registers, register 1 contains all thesecond-bytes of the four registers, and so on.

FIG. 2 shows how two consecutive registers are viewed withword-transpose enabled. Each row in the diagram represents one registeras viewed by a program. When Word-Transpose is enabled, the registerfile is effectively rotated by 90°, so that Register 0 contains all thelow-words of the two registers, register 1 contains all the high-wordsof the two registers

FIG. 3 shows the data in register 0 being byte swapped in two differentways. The first is a lull (DCBA) byte-swap, in which the originaldata-bytes are swapped within the entire 32-bit word, and the secondshows a BADC byte-swap taking place, which swaps the bytes within eachword,

FIGS. 5 a-5 g are a set of related drawings. FIG. 5 a shows a samplehardware register configuration, in which the register is separated intomultiple multiport RAMs, each having multiplexers connected to each ofits data lanes. FIGS. 5 b-5 g show different states of operation of thisregister: FIG. 5 b shows the routing needed for a 32-bit word at address0 without transpose; FIG. 5 c shows routing for address 1 withouttranspose; FIG. 5 d shows the routing needed for the first 32 bits of aneight bit transpose; FIG. 5 e shows the routing for the second 32 bitsof an eight bit transpose; FIG. 5 f shows the routing for address 0 witha 16-bit transpose, in this sample implementation; and FIG. 5 g showsrouting for address 1 with a 16-bit transpose, in this sampleimplementation.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The numerous innovative teachings of the present application will bedescribed with particular reference to the presently preferredembodiment (by way of example, and not of limitation).

Transposable Register-File Operation

The transposable register-file is a novel microprocessor register-filedata organization scheme which overcomes many of the disadvantages oftraditional data organization in microprocessor register-file, and whichhas the benefits of allowing a microprocessor register-file to be viewedin multiple formats with a reduction of the number of programinstructions required during byte, word and long word data reformatting.The preferred embodiment supports both byte-transpose andword-transpose.

Byte-Transpose Register File

FIG. 1 shows how four consecutive registers are viewed withbyte-transpose enabled. With reference to FIG. 1, left hand side (110)illustrates those registers before transpose enabled and right hand side(120) illustrates the same registers after transpose enabled. Each rowin FIG. 1 represents one register as viewed by a program. For instance,bottom row 211 shows Register 0 before transpose enabled. Each registerin turn is composed of four bytes with the left most (for instance 0a)being the lowest byte and the right most (for instance 0d) being thehighest byte. When byte-transpose is enabled, the register file iseffectively rotated by 90°, so that Register 0 (121) contains all thelow-bytes of the four registers, Register 1 (122) contains all thesecond-bytes of the four registers, and so on.

Word-Transpose Register File

Word-transpose is similar to byte-transpose, except that the registerdata is rotated on a per word basis instead of per byte basis. FIG. 2shows how two consecutive registers are viewed with word-transposeenabled. With reference to FIG. 2, left hand side (210) illustratesthose registers before transpose enabled and right hand side (220)illustrates the same registers after transpose enabled. Each row in FIG.2 represents one register as viewed by a program. For instance, bottomrow (111) shows Register 0 before transpose enabled. Each register inturn is composed of two words with the left most (for instance 0a) beingthe low word and the right most (for instance 0b) being the high word.When word-transpose is enabled, the register file is effectively rotatedby 90°, so that Register 0 (221) contains all the low-words of the fourregisters, Register 1 (222) contains all the high-words of the tworegisters.

Register-File Byte-Mapping and Byte-Masking

The register-file byte-mapping and byte-masking functions add furtherflexibility to the novel microprocessor register-file data organizationscheme. This feature of the disclosed inventions allows a program toarbitrarily reorganize the bytes within a register and has the benefitsof further reduction of the number of program instructions requiredduring byte, word and long word data reformatting.

Register-File Byte-Mapping

Byte-Mapping allows a program to arbitrarily reorganize the bytes Withina register in order to isolate, or group, interesting sub-componentswhen reading from, or writing to the register-file. FIG. 3 shows twoexamples of byte-mapping on a register. With reference to FIG. 3, lefthand side (310) illustrates those registers before byte-mapping andright hand side (320) illustrates the same registers after byte-mapping.Each row in FIG. 3 represents one register as viewed by a program. Forinstance, bottom row (311) shows the Register before byte-mapping. Eachregister in turn is composed of four bytes with the left most (forinstance 0a) being the lowest byte and the right most (for instance 0b)being the highest byte. When a byte-mapping of full (DCBA) byte swap isenabled, the original data-bytes are swapped within the entire 32-bitword, and the bytes in register (312) are reorganized as bytes inregister (322). When a byte-mapping of (BADC) byte swap is enabled, theoriginal data-bytes are swapped within each word, and the bytes inregister (31) are reorganized as bytes in register (321).

Resister-File Byte-Masking

The preferred embodiment supports both byte-mapping and byte-masking.Register-file byte masking is another novel microprocessor register-filedata organization scheme that provides control over the bytes that aremodified by an instruction in order to accelerate insertion of data intoexisting register. The program may specify a byte-mask both for sourceoperands and destination operands. When byte-mask is specified forsource operands, parts of a register may be forced to zero on input toan instruction. When byte mask is specified for destination operands,the result of an instruction can be written to parts of a destinationregister.

Indirect Register Access

The indirect register access has the benefits of providing indirectregister access without additional scoreboarding hardware. It providestwo types of instructions: one for moving data from one register toanother register, and another for synchronization.

The instruction format for moving data specifies the followingparameters: a register that holds the source data, a register that holdseither the destination register or the index of the destinationregister, and optionally a count of the number of registers to transfer.If the destination register is directly referenced in the instruction,those registers directly referenced in the instruction are scoreboardedwhen the instruction is executed. However, if the destination registeris not directly referenced in the instruction, those registersindirectly referenced in the instruction are not scoreboarded when theinstruction is executed and synchronization instruction will be used toensure that the data in the register indirectly accessed is correct.

In a typical use of this invention, a programmer uses a number ofregisters as scratchpad memory. Data is loaded into the scratchpad. Ifthere is a switch from a direct to indirect access of register or viceversa, a synchronization instruction is issued to calculate an indexinto the scratchpad and the contents of the register at that index arecopied into a known register. At this point all processing, elements mayuse the same instruction to process data at the same register index.When the calculation is complete, the result may be copied back to thescratchpad and another synchronization instruction is issued tocalculate the index.

Implementation of Hardware Data FIFO in Register-File

The provision of hardware data FIFO in microprocessor register-file usessimilar ideas of indirect register access. This innovative feature, inthe preferred embodiment, sets aside a number of registers from themicroprocessor register-file for the FIFO storage, and provides amechanism for moving data into the FIFO, from another source, and formoving data serially out of the FIFO into other registers within themicroprocessor. It has the benefits of:

-   -   Building the FIFO in the processor register file allows those        registers to be re-used as normal registers when the FIFO is not        needed.    -   The invention allows the size of the FIFO (and thereby the        number of reserved registers) to be changed under program        control    -   It solved the indirect access problem in a hardware register        FIFO implementation.

Example of Use of Transpose and Byte-Mapping

Pixel data is often stored in what is called the RGBA8888 format, inwhich each pixel is made up of red, green, blue, and alpha components,each of 8 bits. All four components are packed into one 32-bit word forconvenience of display.

In common algorithms such as blending the alpha component is used tomodify the color components as follows:

dstR = (srcR * srcA) + dstR dstG = (srcG * srcA) + dstG dstB = (srcB *srcA) + dstB

Sample assembler code for this algorithm is:

mul tmp[0], src[0], src[3] mul tmp[1], src[1], src[3] mul tmp[2],src[2], src[3] add dst[0], tmp[0], dst[0] add dst[1], tmp[1], dst[1] adddst[2], tmp[2], dst[2]Where the syntax is instruction, destination,. source A, source B. Thearray indices refer to the byte position in the pixel.

The code may he reduced if a parallel vector is used, but the alphacomponent must be repeated in each byte of a 32-bit register. This canbe done using a byte swap mode:

set byte swap mode for srcB to DDDD mul tmp, src, src reset byte swapmode for srcB to ABCD add dst, dst, tmp

Note that this code only produces 3 bytes of results even though theregisters hold 4 bytes, If 4 pixels are processed as a serial vectorthis inefficiency can be removed:

transpose srcA transpose srcB vector_3_mul tmp, src, src transpose dstvector_3_add dst, dst, tmp

Transposing srcA causes all the red components to be in one register,all the green in another, and all the blue in a third. Transposing srcBcauses all the alpha components to be in one register. Vectorinstruction of length three cause four pixels to be processed in 3instructions (the stride of the srcB vector must be zero to use the samealpha value for each component).

Resister File implementation

Details of a sample implementation will now be described. In thisimplementation, the register file is used for all storage within theprocessing element and holds a generous 256 registers, each 32-bitswide. The registers are perhaps more important to overall systemperformance than the ALU because they control the movement of data, anda SIMD array typically has high compute performance relative to data.bandwidth. The register file can be large because it absorbs a number ofFIFOs that would normally be needed to feed the ALU. All registers arepreferably scoreboarded, so any instruction that attempts to read aregister that has a write scheduled for it will stall until the writecompletes.

Parallel Vectors

To make good use of the ALU, several data items may be packed into oneregister. The ALU may work on four 8-bit items at a time, or two 16-bititems, but the operation is always the same. This is similar to vectorcalculations, and when more than one item of data is held in a registerit is referred to as a parallel vector (pvec as opposed to svec forvectors executed sequentially). Pvecs can boost performance if it is nottoo expensive to get data into an appropriate format.

An example of using pvecs is to take four pixels of red, green, blue,and alpha, and re-group them such that common components are in the sameregister (so grouped as RRRR, GGGG, BBBB, AAAA). Then differentoperations can be applied to each component at full speed (it is commonfor alpha to be processed differently than RGB). If you imagine the fourpixels as a four by four array of bytes, the source format has RGBA inrows and the processing needs them in columns and to get into thisformat requires transposing the pvecs. After processing is complete thetranspose needs to be reversed.

The register file supports zero-cost transposing for 8 or 16 bit pvecs.If the data type is 16 bits the register set is treated as being inpairs and the transposition takes place assuming two registers hold a2×2 array of data. If the data type is 8 bits then four registers areassumed to hold a 4×4 array of data. Transposition is free because theregister file is made up of four separate RAMs, which gives access tofour different registers at the same time. The lower bits of theregister address select the bytes to use, so registers to be transposedmust be in sequential registers and must be aligned, to the number ofregisters that will be transposed.

Transposition also allows efficient memory access for 24 bit components.If data is stored byte-planar with four bytes of each component storedin the same 32 bit word the layout would be as shown in. FIG. 4. This isa useful way to store 24 bit data because there is no wastage butneither is there a difficult address calculation or nasty data shifting.In some algorithms it is convenient to process the componentsindividually, but in others the whole pixel may be needed. Transpositionallows this byte planar format to be converted into 32 bit pixels.

The register file has, in principle, three read ports and two writeports. Two of the read ports are used by the ALU, as is one of the writeports. The remaining read and write ports are used to get memory data inand out of the registers. For best performance the RAM used to build theregister file should have all five ports, but that will make it large. Acompromise is possible in which one read and one write port are removed.

Because the register file is made up of four separate RAMs fortransposition, it is possible to arrange accesses to them so that whilethe ALU accesses one RAM another can be used for memory data. The vectoroperations result in the registers being accessed in a predictablepattern. The trick is to arrange the addressing so that memory accessesfollow the same pattern as vector operations, but staggered so that theydon't use the same RAM at the same time. This is not always possiblewhen transposing because the ALU may need access to all four RAMs. Whenthere is contention for the register file the memory wins and the ALUstalls (this is the cost of not having all 5 ports).

Indirect Accesses

Indirect register access allows the contents of one register to form theindex to another. It is obviously useful for histograms, but also forFFT data shuffling and median filtering. It is difficult to implementbecause all PEs may access different registers, which breaks the SIMDmodel and requires additional scoreboarding hardware.

The media processor imposes a slight restriction that avoids thehardware cost. Special instructions are used to copy data from oneregister to another; the register to copy from (or to) is specified inanother register. The restriction is that while indirection is in useany register that may be indirectly accessed must not be used directly.This removes the need to scoreboard the indirectly accessed register,while the directly accessed register is scoreboarded to ensure correctoperation. The cost is an extra instruction per indirection.

Details of Sample Hardware Implementation

FIGS. 5 a-5 g are a set of related drawings, which collectively show asample hardware implementation and its various operational in odes.

FIG. 5 a shows a sample hardware register configuration, in which theregister is separated into multiple multiport RAMs 510, each havingmultiplexers 520 connected to each of its data lanes. Four RAMs may beconnected to support transposing. Each RAM is 32 bits wide and showsfour bytewide lanes. Each RAM holds every fourth entry in the registerfile. The dotted boxes are multiplexers that switch between the twoinputs. This hardware implementation permits all of the above functionalrelationships to be realized.

The multiplexers can be, for example, simple by-8 circuits having twostates, selected by a single control bit (per multiplexer). Thesecontrol bits can be set, for example, by appropriate configurationinstructions.

FIGS. 5 b-5 g show different states of operation of this register. Inthese diagrams, only the active inputs to active multiplexers 520 areshown.

FIG. 5 b shows the routing needed for a 32-bit word at address 0 withouttranspose, in this sample implementation.

FIG. 5 c shows routing for address 1 without transpose, in this sampleimplementation

FIG. 5 d shows the routing needed for the first 32 bits of an eight bittranspose; the lower byte of each RAM is connected to a different bytelane, in this sample implementation.

FIG. 5 e shows the routing for the second 32 bits of an eight bittranspose. in this sample implementation.

FIG. 5 f shows the routing for address 0 with a 6-bit transpose, in thissample implementation.

FIG. 5 g shows routing for address 1 with a 16-bit transpose, in thissample implementation.

This hardware implementation can of course be varied, but this shows howan extremely versatile set of output reordering options can be achievedby multiplexing, WITHOUT unnecessary register-to-register transfers(which consume power).

Additional detail of the preferred implementation is shown in U.S.application Ser. No. 11/536,483, which is hereby incorporated byreference in its entirety. This implementation is an advantageouscontext for the disclosed inventions, but it should be emphasized thatthe I/O architecture described in the present application can also beused in many other contexts.

According, to a disclosed class of innovative embodiments, there isprovided: A method of selectably transposing data accessed in aregister, comprising the actions of: storing data in n memory segments,each having n data lanes at the output thereof; and selectablyconnecting each of n data bus segments to a respective one of said n²data lanes; whereby a desired data transposition. is provided at thetime of register access without register-to-register transfers.

According to a disclosed class of innovative embodiments, there isprovided: An electronic system, comprising: a logic unit; and at leastone I/O register, comprising multiple memory segments each holding arespective fraction of a data set, said data set being distributedacross said segments in a consistent pattern, and each said memorysegment providing multiple lanes of data path; and multiplemultiplexers, each connected to connect a respective output bus segmentto a respective data path of a respective one of said memory segments.

Modifications and Variations

As will be recognized by those skilled in the art, the innovativeconcepts described in the present application can be modified and variedover a tremendous range of applications, and accordingly the scope ofpatented subject matter is not limited by any of the specific exemplaryteachings given.

For example, the multiple access modes provided by the disclosedembodiments are particularly useful for graphics and image processing,,they can also be especially useful for data which has internal 3-D or4-D structure (e.g. a time series of voxel images). In such cases thecapability for customized data transpositions can help with filteringand transformations.

For another example, a flexible register can optionally implement somebut not all of the transpositions described above, and/or can implementadditional transpositions besides those listed.

For another example, the disclosed hardware implementation usesbyte-wide “lanes”, but alternatively and less preferably a differentfineness can be used. If fast nibble transpositions are desired, 8 RAMscould be used instead of four, with 8 lanes instead of four on each RAM,and 8 output busses instead of four. Note, however, that the number ofmultiplexers would quadruple if this were done.

For another alternative and less preferable example, more logic can beadded into the multiplexers if desired. For instance, the multiplexerscan be given additional states wherein the 8-bit output is not onlyconnected to a selected input for none), but wherein the bits of theinput can be permuted, pairwise exchanged, complemented, ANDed, etc.Additional control bits would preferably be routed to the multiplexersin such cases.

None of the description in the present application should be read asimplying that any particular element, step, or function is an essentialelement which must be included in the claim scope: THE SCOPE OF PATENTEDSUBJECT MATTER IS DEFINED ONLY BY THE ALLOWED CLAIMS. Moreover, none ofthese claims are intended to invoke paragraph six of 35 USC section 112unless the exact words “means for” are followed by a participle.

The claims as filed are intended to be as comprehensive as possible, andNO subject matter is intentionally relinquished, dedicated, orabandoned.

What is claimed is:
 1. A method of selectably transposing data accessed in a register, comprising the actions of: storing data in n memory segments, each having n data lanes at the output thereof; and selectably connecting each of n data bus segments to a respective one of said n² data lanes; whereby a desired data transposition is provided at the time of register access without register-to-register transfers.
 2. The method of claim 1, wherein each said data lane carries 8 bits of data.
 3. The method of claim 1, wherein said selectably connecting step is performed by activating only n of a total of n² multiplexers.
 4. The method of claim 1, wherein n=4.
 5. A method of viewing data within a register file, comprising the steps of: enabling a transpose function with respect to a selected register; and modifying a view of the selected register, as seen by an external access, such that data in the selected register is replaced with data from a plurality of consecutive registers.
 6. The method of claim 5, wherein said modifying step effectively rotates the apparent orientation of data in said selected register.
 7. The method of claim 5, wherein said modifying step effectively applies bytewise transposition to said view.
 8. The method of claim 5, wherein said modifying step effectively applies wordwise transposition to said view.
 9. A method of viewing data within a register file, comprising the steps of: identifying a first source byte in a register; copying data from the first source byte into a first destination byte of the register; identifying a second source byte in the register; and copying data from the second source byte into a second destination byte of the register.
 10. An electronic system, comprising: a logic unit; and at least one I/O register, comprising multiple memory segments each holding a respective fraction of a data set, said data set being distributed across said segments in a consistent pattern, and each said memory segment providing multiple lanes of data path; and multiple multiplexers, each connected to connect a respective output bus segment to a respective data path of a respective one of said memory segments. 